Finding canonical forms for historical German text
نویسنده
چکیده
Historical text presents numerous challenges for contemporary natural language processing techniques. In particular, the absence of consistent orthographic conventions in historical text presents difficulties for any technique or system requiring reference to a fixed lexicon accessed by orthographic form. This paper presents two methods for mapping unknown historical text types to one or more synchronically active canonical types: conflation by phonetic form, and conflation by lemma instantiation heuristics. Implementation details and evaluation of both methods are provided for a corpus of historical German verse quotation evidence from the digital edition of the Deutsches Wörterbuch.
منابع مشابه
Comparing Canonicalizations of Historical German Text
Historical text presents numerous challenges for contemporary natural language processing techniques. In particular, the absence of consistent orthographic conventions in historical text presents difficulties for any system requiring reference to a static lexicon accessed by orthographic form. In this paper, we present three methods for associating unknown historical word forms with synchronica...
متن کاملConstructing a Canonicalized Corpus of Historical German by Text Alignment ---draft
Historical text presents numerous challenges for contemporary natural language processing techniques. In particular, the absence of consistent orthographic conventions in historical text presents difficulties for any system requiring reference to a static lexicon indexed by orthographic form. Canonicalization approaches seek to address these issues by assigning an extant equivalent to each word...
متن کاملMore than Words: Using Token Context to Improve Canonicalization of Historical German
Historical text presents numerous challenges for contemporary natural language processing techniques. In particular, the absence of consistent orthographic conventions in historical text presents difficulties for any system requiring reference to a fixed lexicon accessed by orthographic form, such as information retrieval systems (Sokirko, 2003; Cafarella and Cutting, 2004), part-of-speech tagg...
متن کاملManual and semi-automatic normalization of historical spelling - case studies from Early New High German
This paper presents work on manual and semi-automatic normalization of historical language data. We first address the guidelines that we use for mapping historical to modern word forms. The guidelines distinguish between normalization (preferring forms close to the original) and modernization (preferring forms close to modern language). Average inter-annotator agreement is 88.38% on a set of da...
متن کاملText Screening (Censorship) in Iran: A Historical Perspective
Censorship has a long history in Iran that has interfered with text production, i.e., original writing as well as translation. This phenomenon seems to have marked the borderline between the government and the ‘enlightened’ intellectuals throughout history in Iran. Different governments have delineated ‘redlines’ for authors and translators and dealt with these constructors of culture based on ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008